DATA CLEANING (SQL)
About this project
Through the Below SQL queries, the housing
data was thoroughly cleaned, standardized, and prepared for further analysis
or integration into other systems. The project demonstrates proficiency in data cleaning techniques using SQL
and
ensures the dataset's quality and consistency for downstream processes or analysis.
1 - Standardize Date Format:
Used SQL CONVERT function to standardize the date format in the SaleDate column.
Added a new column SaleDateConverted to store the standardized dates.
2 - Populate Property Address Data:
Filled missing property addresses by joining the dataset with itself based on ParcelID and copying the
non-null property
addresses.
3 - Break Address into Individual Columns:
Split the PropertyAddress column into separate columns for PropertyNumAddress, PropertyCity, and
PropertyState.
Utilized SQL string manipulation functions such as SUBSTRING and CHARINDEX to extract the required
information.
4 - Break Owner Address into Individual Columns:
Similar to the property address, split the OwnerAddress column into OwnerNumAddress, OwnerCity, and OwnerState
columns.
Used the PARSENAME function to parse the address components.
5 - Change Y and N to Yes and No:
Converted 'Y' and 'N' values in the SoldAsVacant column to 'Yes' and 'No', respectively, for better
readability.
Implemented using SQL UPDATE statements with conditional logic.
6 - Remove Duplicates:
Eliminated duplicate records based on specific columns (ParcelID, PropertyAddress, SalePrice, SaleDate,
LegalReference)
using a common table expression (CTE) and the ROW_NUMBER() window function.
Deleted duplicate records, keeping only one occurrence of each unique combination.
7 - Delete Unused Columns:
Removed unnecessary columns (SaleDate, PropertyAddress, OwnerAddress) from the dataset to streamline the data
and
improve clarity.
Employed the SQL ALTER TABLE statement with the DROP COLUMN clause to delete the specified columns.